Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472188
S. Mahapatra, R. Mahapatra
This paper presents a mapping scheme for parallel pipelined execution of the Backpropagation Learning Algorithm on distributed memory multiprocessors (DMMs). The proposed implementation exhibits training set parallelism that involves batch updating. Simple algorithms have been presented, which allow the data transfer involved in both forward and backward executions phases of the backpropagation algorithm to be carried out with a small communication overhead. The effectiveness of our mapping has been illustrated, by estimating the speedup of a proposed implementation on an array of T-805 transputers.<>
{"title":"Mapping of backpropagation learning onto distributed memory multiprocessors","authors":"S. Mahapatra, R. Mahapatra","doi":"10.1109/ICAPP.1995.472188","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472188","url":null,"abstract":"This paper presents a mapping scheme for parallel pipelined execution of the Backpropagation Learning Algorithm on distributed memory multiprocessors (DMMs). The proposed implementation exhibits training set parallelism that involves batch updating. Simple algorithms have been presented, which allow the data transfer involved in both forward and backward executions phases of the backpropagation algorithm to be carried out with a small communication overhead. The effectiveness of our mapping has been illustrated, by estimating the speedup of a proposed implementation on an array of T-805 transputers.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130344975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472271
Bin Zhou, Mohammed Atiquzzaman
Multistage Interconnection Networks (MIN) are used to connect processors and memories in large scale scalable multiprocessor systems. MINs have also been proposed as switching fabrics in ATM networks in the future Broadband ISDN networks. A MIN consists of several stages of small crossbar switching elements (SE). Buffers are used in the SEs to increase the throughput of the MIN and prevent internal loss of packets. Different buffering schemes for the SEs are discussed in this paper. The objective of this paper is to study the performance of MINs with different buffering schemes, in the presence of uniform and hot spot traffic patterns. The results obtained from the study will help the network designers in choosing appropriate buffering strategies for MINs. For comparing different buffering strategies, the throughput and packet delay have been used as the performance measures.<>
{"title":"A performance comparison of buffering schemes for multistage switches","authors":"Bin Zhou, Mohammed Atiquzzaman","doi":"10.1109/ICAPP.1995.472271","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472271","url":null,"abstract":"Multistage Interconnection Networks (MIN) are used to connect processors and memories in large scale scalable multiprocessor systems. MINs have also been proposed as switching fabrics in ATM networks in the future Broadband ISDN networks. A MIN consists of several stages of small crossbar switching elements (SE). Buffers are used in the SEs to increase the throughput of the MIN and prevent internal loss of packets. Different buffering schemes for the SEs are discussed in this paper. The objective of this paper is to study the performance of MINs with different buffering schemes, in the presence of uniform and hot spot traffic patterns. The results obtained from the study will help the network designers in choosing appropriate buffering strategies for MINs. For comparing different buffering strategies, the throughput and packet delay have been used as the performance measures.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132613045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472207
A. Roberts, A. Symvonis
In this paper, we consider the deflection worm routing problem on two dimensional n/spl times/n meshes. Our results include: (i) an off-line algorithm for routing permutations in O(kn) steps, and (ii) a general method to obtain deflection worm routing algorithms from packet routing algorithms.<>
{"title":"On deflection worm routing on meshes","authors":"A. Roberts, A. Symvonis","doi":"10.1109/ICAPP.1995.472207","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472207","url":null,"abstract":"In this paper, we consider the deflection worm routing problem on two dimensional n/spl times/n meshes. Our results include: (i) an off-line algorithm for routing permutations in O(kn) steps, and (ii) a general method to obtain deflection worm routing algorithms from packet routing algorithms.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114075009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472212
R. Neogi
DCT/IDCT bared source coding and decoding techniques are widely accepted in HDTV systems and other MPEG based applications. In this paper, we propose a new direct 2-D IDCT algorithm bared on the parallel divide-and-conquer approach. The algorithm distributes computation by considering one transformed coefficient at a time and doing partial computation and updating as every coefficient arrives. A novel parallel and fully pipelined architecture with an effective processing time of one cycle per pixel for an N/spl times/N size block is designed to implement the algorithm. An unique feature of this architecture is that it integrates inverse-shuffling, inverse-quantization, inverse-source-coding, and motion-compensation into a single compact data-path. We avoid the insertion of a FIFO between the bit-stream decoder and decompression engine. The entire block of pixel values are sampled in a single cycle for post-processing after de-compression. Also, we use only (N/2(N/2+1))/2 multipliers and N/sup 2/ adders.<>
{"title":"Embedded real-time video decompression algorithm and architecture for HDTV applications","authors":"R. Neogi","doi":"10.1109/ICAPP.1995.472212","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472212","url":null,"abstract":"DCT/IDCT bared source coding and decoding techniques are widely accepted in HDTV systems and other MPEG based applications. In this paper, we propose a new direct 2-D IDCT algorithm bared on the parallel divide-and-conquer approach. The algorithm distributes computation by considering one transformed coefficient at a time and doing partial computation and updating as every coefficient arrives. A novel parallel and fully pipelined architecture with an effective processing time of one cycle per pixel for an N/spl times/N size block is designed to implement the algorithm. An unique feature of this architecture is that it integrates inverse-shuffling, inverse-quantization, inverse-source-coding, and motion-compensation into a single compact data-path. We avoid the insertion of a FIFO between the bit-stream decoder and decompression engine. The entire block of pixel values are sampled in a single cycle for post-processing after de-compression. Also, we use only (N/2(N/2+1))/2 multipliers and N/sup 2/ adders.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122309485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472199
W. Cheng, X. Jia, M. Werner
A number of multicast algorithms have been proposed to guarantee potential causal ordering. In these mechanisms, the delivery of a message would be blocked waiting for the delivery of causally earlier messages, even though it may not actually have a causal relationship with these messages. This represents a higher latency cost than necessary. Our objective is to reduce the latency time of multicast message delivery. This can be achieved by reducing the number of messages that a multicast message has to wait for. We propose a mechanism in which the delivery of a message would only be blocked by the delivery of messages with which it has an actual causal relationship. The mechanism includes causality information, supplied by users, in the multicast messages. Receivers deliver messages to application processes according to this information. We introduce a programming construction, message blocks, to simplify the task of expressing causality. Simulation results are included and discussed in detail.<>
{"title":"A multicast mechanism for actual causal ordering","authors":"W. Cheng, X. Jia, M. Werner","doi":"10.1109/ICAPP.1995.472199","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472199","url":null,"abstract":"A number of multicast algorithms have been proposed to guarantee potential causal ordering. In these mechanisms, the delivery of a message would be blocked waiting for the delivery of causally earlier messages, even though it may not actually have a causal relationship with these messages. This represents a higher latency cost than necessary. Our objective is to reduce the latency time of multicast message delivery. This can be achieved by reducing the number of messages that a multicast message has to wait for. We propose a mechanism in which the delivery of a message would only be blocked by the delivery of messages with which it has an actual causal relationship. The mechanism includes causality information, supplied by users, in the multicast messages. Receivers deliver messages to application processes according to this information. We introduce a programming construction, message blocks, to simplify the task of expressing causality. Simulation results are included and discussed in detail.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"319 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472270
M. Saleh, Mohammed Atiquzzaman
Multistage interconnection networks based on shared buffering are known to have better performance and buffer utilization than input or output buffered switches. Shared buffer switches do not suffer from head of line blocking which is a common problem in simple input buffering. Shared buffer switches have previously been studied under uniform and unbalanced traffic patterns. However, due to the complexity of the model, the performance of such a network, in the presence of a single hot spot, has not been fully explored. A hot spot arises when one of the outputs of the network becomes very popular. We develop a model for a multistage interconnection network constructed from shared buffer switching elements and operating under a hot spot traffic pattern. The model is validated by comparison with simulation results. The model is used to study the network performance in terms of the throughput, packet delay, packet loss probability and the optimal buffer utilization, Numerical results show that, in the presence of hot spot traffic, shared buffer switches degrade more significantly than switches with dedicated input and/or output buffers.<>
{"title":"Analysis of shared buffer multistage networks with hot spot","authors":"M. Saleh, Mohammed Atiquzzaman","doi":"10.1109/ICAPP.1995.472270","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472270","url":null,"abstract":"Multistage interconnection networks based on shared buffering are known to have better performance and buffer utilization than input or output buffered switches. Shared buffer switches do not suffer from head of line blocking which is a common problem in simple input buffering. Shared buffer switches have previously been studied under uniform and unbalanced traffic patterns. However, due to the complexity of the model, the performance of such a network, in the presence of a single hot spot, has not been fully explored. A hot spot arises when one of the outputs of the network becomes very popular. We develop a model for a multistage interconnection network constructed from shared buffer switching elements and operating under a hot spot traffic pattern. The model is validated by comparison with simulation results. The model is used to study the network performance in terms of the throughput, packet delay, packet loss probability and the optimal buffer utilization, Numerical results show that, in the presence of hot spot traffic, shared buffer switches degrade more significantly than switches with dedicated input and/or output buffers.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127732435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472196
Z. Leyk, M. Dow
We are implementing iterative methods on the VPP500 parallel computer. During this process we have met with different kind of problems. It is easy to notice that performance on the VPP500 depends critically on the type of matrices taken for computations. In sparse computations, it is important to take advantage of the structure of the matrix. There can be a big difference between the performance obtained from a matrix stored in the diagonal format and one stored in a more general format. Therefore it is necessary to choose an appropriate format for a matrix used in computations. Preliminary tests show that implementation of the package is scalable with respect to the number of processors, especially for large problems. It is becoming clear for us that the traditional efficient preconditioning techniques result only in a speedup of factor 2 at best. We need to look for new preconditioners more adjusted for parallel computations. The polynomial preconditioning approach is attractive because of the negligible preprocessing cost involved. We favour the reverse communication interface for added flexibility necessary for doing tests with different storage formats and preconditioners. We can conclude that it is crucial to experiment with existing parallel machines to better understand the effects that are difficult to derive from theory, such as impact of communication costs or ways of storing data.<>
{"title":"Package of iterative solvers for the Fujitsu VPP500 parallel supercomputer","authors":"Z. Leyk, M. Dow","doi":"10.1109/ICAPP.1995.472196","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472196","url":null,"abstract":"We are implementing iterative methods on the VPP500 parallel computer. During this process we have met with different kind of problems. It is easy to notice that performance on the VPP500 depends critically on the type of matrices taken for computations. In sparse computations, it is important to take advantage of the structure of the matrix. There can be a big difference between the performance obtained from a matrix stored in the diagonal format and one stored in a more general format. Therefore it is necessary to choose an appropriate format for a matrix used in computations. Preliminary tests show that implementation of the package is scalable with respect to the number of processors, especially for large problems. It is becoming clear for us that the traditional efficient preconditioning techniques result only in a speedup of factor 2 at best. We need to look for new preconditioners more adjusted for parallel computations. The polynomial preconditioning approach is attractive because of the negligible preprocessing cost involved. We favour the reverse communication interface for added flexibility necessary for doing tests with different storage formats and preconditioners. We can conclude that it is crucial to experiment with existing parallel machines to better understand the effects that are difficult to derive from theory, such as impact of communication costs or ways of storing data.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116743448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472211
P. Cremonesi, M. Pugassi, N. Scarabottolo
The problem considered in this paper is the implementation of motion-detection on distributed-memory MIMD machines. The solution here proposed is based on pyramidal algorithms that, by iteratively discarding uninteresting details, allow to focus on the moving parts of an image stream. Different parallelisation methodologies have been evaluated and the most promising ones have been implemented on a Transputer-based parallel machine. Experimental results are here presented and compared with the theoretical ones.<>
{"title":"Motion detection on distributed-memory machines: a case study","authors":"P. Cremonesi, M. Pugassi, N. Scarabottolo","doi":"10.1109/ICAPP.1995.472211","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472211","url":null,"abstract":"The problem considered in this paper is the implementation of motion-detection on distributed-memory MIMD machines. The solution here proposed is based on pyramidal algorithms that, by iteratively discarding uninteresting details, allow to focus on the moving parts of an image stream. Different parallelisation methodologies have been evaluated and the most promising ones have been implemented on a Transputer-based parallel machine. Experimental results are here presented and compared with the theoretical ones.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114950596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472217
K. W. Chong, Tak-Wah Lam
Finding the connected components of a graph is a basic computational problem. In recent years, there were several exciting results in breaking the log/sup 2/ n-time barrier to finding connected components on parallel machines using shared memory without concurrent-write capability. This paper further presents two new parallel algorithms both using less than log/sup 2/ n time. The merit of the first algorithm is that it uses only a sublinear number of processors, yet retains the time complexity of the fastest existing algorithm. The second algorithm is slightly slower but its work (i.e., the time-processor product) is closer to optimal than all previous algorithms using less than log/sup 2/ n time.<>
寻找图的连通分量是一个基本的计算问题。近年来,有几个令人兴奋的结果打破了log/sup 2/ n-time的障碍,在使用共享内存而没有并发写能力的并行机器上找到连接的组件。本文进一步提出了两种新的并行算法,时间均小于log/sup 2/ n。第一种算法的优点是它只使用亚线性数量的处理器,但保留了现有最快算法的时间复杂度。第二种算法稍微慢一点,但它的工作(即时间处理器乘积)比之前的所有算法更接近于最优,所用时间小于log/sup 2/ n
{"title":"Improved parallel algorithms for finding connected components","authors":"K. W. Chong, Tak-Wah Lam","doi":"10.1109/ICAPP.1995.472217","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472217","url":null,"abstract":"Finding the connected components of a graph is a basic computational problem. In recent years, there were several exciting results in breaking the log/sup 2/ n-time barrier to finding connected components on parallel machines using shared memory without concurrent-write capability. This paper further presents two new parallel algorithms both using less than log/sup 2/ n time. The merit of the first algorithm is that it uses only a sublinear number of processors, yet retains the time complexity of the fastest existing algorithm. The second algorithm is slightly slower but its work (i.e., the time-processor product) is closer to optimal than all previous algorithms using less than log/sup 2/ n time.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130251059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-04-19DOI: 10.1109/ICAPP.1995.472172
G. A. Papadopoulos
Term Graph Rewriting Systems (TGRS) have been used extensively as an implementation vehicle for a number of, often divergent, programming paradigms ranging from the traditional functional programming ones to the (concurrent) logic programming ones and various amalgamations of them, to (concurrent) object-oriented ones. More recently, the relationship between TGRS and process calculi (such as the /spl pi/-calculus) as well as Linear Logic has also been explored. In this paper we describe our experience in using an intermediate Compiler Target Language (CTL) based on TGRS for mapping a variety of programming paradigms of the aforementioned types onto it, highlighting in the process some of the issues which we feel any such intermediate representation should address and which form effectively a minimum set of features every CTL should possess.<>
{"title":"Essential features of a compiler target language for parallel machines","authors":"G. A. Papadopoulos","doi":"10.1109/ICAPP.1995.472172","DOIUrl":"https://doi.org/10.1109/ICAPP.1995.472172","url":null,"abstract":"Term Graph Rewriting Systems (TGRS) have been used extensively as an implementation vehicle for a number of, often divergent, programming paradigms ranging from the traditional functional programming ones to the (concurrent) logic programming ones and various amalgamations of them, to (concurrent) object-oriented ones. More recently, the relationship between TGRS and process calculi (such as the /spl pi/-calculus) as well as Linear Logic has also been explored. In this paper we describe our experience in using an intermediate Compiler Target Language (CTL) based on TGRS for mapping a variety of programming paradigms of the aforementioned types onto it, highlighting in the process some of the issues which we feel any such intermediate representation should address and which form effectively a minimum set of features every CTL should possess.<<ETX>>","PeriodicalId":448130,"journal":{"name":"Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130789411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}