Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574048
C. Dong, Weimin Zheng, Dingxing Wang, M. Shen
In this paper, we argue that because of recent advance of network & CPU technologies, workstation clusters are poised to become the primary parallel computing infrastructure for science and engineering computing. After analyzing and comparing the communication performance of three popular networks: 10 Mbps Ethernet, 100 Mbps Ethernet and 640 Mbps Myrinet on an experimental workstation cluster, we point out that two main factors hinder the wider application of workstation cluster: low efficiency of communication system (both hardware and software) and lack of friendly parallel program development environment with accessory tools. For these two problem, we implemented two workstation cluster systems for different performance/price rate requirements: one is 8 PowerPCs with shared media network, another is 8 Sun Sparcstations with switch network. By using Reduced Communication Protocol (RCP), we dramatically improved the performance of communication system; by expanding the language support of PVM and adding several useful tools, we build a visual integrated parallel program development environment IPCE. On our platform, we also analyzed several massive applications, such as GRI benchmark, earthquake simulator, weather forecasting and some NAS benchmarks, and we get very good results for these coarse-grain to middle-grain applications. The speedup ranges from 5.83 to 7.98 and parallel efficiency reaches to 72.88%-99.7%.
{"title":"A scalable parallel workstation cluster system","authors":"C. Dong, Weimin Zheng, Dingxing Wang, M. Shen","doi":"10.1109/APDC.1997.574048","DOIUrl":"https://doi.org/10.1109/APDC.1997.574048","url":null,"abstract":"In this paper, we argue that because of recent advance of network & CPU technologies, workstation clusters are poised to become the primary parallel computing infrastructure for science and engineering computing. After analyzing and comparing the communication performance of three popular networks: 10 Mbps Ethernet, 100 Mbps Ethernet and 640 Mbps Myrinet on an experimental workstation cluster, we point out that two main factors hinder the wider application of workstation cluster: low efficiency of communication system (both hardware and software) and lack of friendly parallel program development environment with accessory tools. For these two problem, we implemented two workstation cluster systems for different performance/price rate requirements: one is 8 PowerPCs with shared media network, another is 8 Sun Sparcstations with switch network. By using Reduced Communication Protocol (RCP), we dramatically improved the performance of communication system; by expanding the language support of PVM and adding several useful tools, we build a visual integrated parallel program development environment IPCE. On our platform, we also analyzed several massive applications, such as GRI benchmark, earthquake simulator, weather forecasting and some NAS benchmarks, and we get very good results for these coarse-grain to middle-grain applications. The speedup ranges from 5.83 to 7.98 and parallel efficiency reaches to 72.88%-99.7%.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122477444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574025
Xue-bin Chi
In this paper, we consider the parallel implementation of solving generalized eigenproblem of Hermitian type matrices on Dawning-1000. It arises from the theoretical analysis of nonlinear optical crystal structures. We use Cholesky factorisation, Househoulder transformation, bisection method and inverse iteration to complete the computation. The implementation is based on the BLAS library and communication function library provided on Dawning-1000. The numerical results show very good performance and the application in physics is satisfactory.
{"title":"Parallel solver of generalized eigenproblem on Dawning-1000","authors":"Xue-bin Chi","doi":"10.1109/APDC.1997.574025","DOIUrl":"https://doi.org/10.1109/APDC.1997.574025","url":null,"abstract":"In this paper, we consider the parallel implementation of solving generalized eigenproblem of Hermitian type matrices on Dawning-1000. It arises from the theoretical analysis of nonlinear optical crystal structures. We use Cholesky factorisation, Househoulder transformation, bisection method and inverse iteration to complete the computation. The implementation is based on the BLAS library and communication function library provided on Dawning-1000. The numerical results show very good performance and the application in physics is satisfactory.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129609165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574014
Yueming Hu
To design a parallel computer system, selecting an appropriate network is an important issue. This paper presents the simulation results on the performance of message passing interconnection networks used commonly in multiprocessor systems. Comparisons have been made on the performance of various interconnection networks like crossbar, mesh, hypercube, tree and hypertree with wormhole routing. The performance factors compared include the throughput of these networks and message delay. To make a more general model for tree structured network, this paper present the definition of m-fold n-ary tree, which is the extension of the hypertree network.
{"title":"A simulation research on multiprocessor interconnection networks with wormhole routing","authors":"Yueming Hu","doi":"10.1109/APDC.1997.574014","DOIUrl":"https://doi.org/10.1109/APDC.1997.574014","url":null,"abstract":"To design a parallel computer system, selecting an appropriate network is an important issue. This paper presents the simulation results on the performance of message passing interconnection networks used commonly in multiprocessor systems. Comparisons have been made on the performance of various interconnection networks like crossbar, mesh, hypercube, tree and hypertree with wormhole routing. The performance factors compared include the throughput of these networks and message delay. To make a more general model for tree structured network, this paper present the definition of m-fold n-ary tree, which is the extension of the hypertree network.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129176599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574022
Yuguang Huang
In this paper, a parallel algorithm for solving tridiagonal equations based on recurrence is presented. Compared with the parallel prefix method (PP) which is also based on the recursive method, the computation cost is reduced by a factor of two while maintaining the same communication cost. The method can be viewed as a modified prefix method or prefix with substructuring. The complexity of the algorithm is analysed using the BSP model (Bulk Synchronous Parallel). Experimental results are obtained on a Sun workstation using the Oxford BSP Library.
{"title":"Parallel recursive algorithm for tridiagonal systems","authors":"Yuguang Huang","doi":"10.1109/APDC.1997.574022","DOIUrl":"https://doi.org/10.1109/APDC.1997.574022","url":null,"abstract":"In this paper, a parallel algorithm for solving tridiagonal equations based on recurrence is presented. Compared with the parallel prefix method (PP) which is also based on the recursive method, the computation cost is reduced by a factor of two while maintaining the same communication cost. The method can be viewed as a modified prefix method or prefix with substructuring. The complexity of the algorithm is analysed using the BSP model (Bulk Synchronous Parallel). Experimental results are obtained on a Sun workstation using the Oxford BSP Library.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114625008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574036
Chengqing Ye, Zhonghai Wu, Changsheng Yang
This paper proposes the design of a scalable memory-shared multiprocessing system SMMP which supports Client/Server mode. SMMP system is composed of two-level interconnection networks, three-level memory subsystem and three-level I/O subsystem. There are many advantages in the design of our SMMP, such as scalable, easy to implement and operate, general purpose and large I/O throughput. It can be an excellent server for high-speed communication network.
{"title":"Study and design of scalable memory-shared multiprocessing system","authors":"Chengqing Ye, Zhonghai Wu, Changsheng Yang","doi":"10.1109/APDC.1997.574036","DOIUrl":"https://doi.org/10.1109/APDC.1997.574036","url":null,"abstract":"This paper proposes the design of a scalable memory-shared multiprocessing system SMMP which supports Client/Server mode. SMMP system is composed of two-level interconnection networks, three-level memory subsystem and three-level I/O subsystem. There are many advantages in the design of our SMMP, such as scalable, easy to implement and operate, general purpose and large I/O throughput. It can be an excellent server for high-speed communication network.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124473343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574041
W. Jia, Chan-Hee Lee, X. Jia, Jiannong Cao
Reliable multicast services in a group of autonomous distributed processes/sites are desirable to maintain the consistent state of shared information accessed by transactions in distributed systems. Many existing protocols are complicated and thus quite expensive and not efficient for availability of distributed systems. This paper discusses the design and implementations of a new logical token ring based multicast communications services. It provides total ordering, atomicity of multicast messages membership and fault-tolerant services in the presence of sites fail stop and network partitioning. An unique feature of the protocol is that all members, knowing exactly, in the group, who holds the token, are able to detect right order of a multicast message, thereby, reducing the synchronous overhead, preventing possible token loss problems and minimizing control messages. The services are implemented by using finite state machine approach and they are highly efficient comparing with related services in the same network settings.
{"title":"Implementation of efficient and reliable multicast servers","authors":"W. Jia, Chan-Hee Lee, X. Jia, Jiannong Cao","doi":"10.1109/APDC.1997.574041","DOIUrl":"https://doi.org/10.1109/APDC.1997.574041","url":null,"abstract":"Reliable multicast services in a group of autonomous distributed processes/sites are desirable to maintain the consistent state of shared information accessed by transactions in distributed systems. Many existing protocols are complicated and thus quite expensive and not efficient for availability of distributed systems. This paper discusses the design and implementations of a new logical token ring based multicast communications services. It provides total ordering, atomicity of multicast messages membership and fault-tolerant services in the presence of sites fail stop and network partitioning. An unique feature of the protocol is that all members, knowing exactly, in the group, who holds the token, are able to detect right order of a multicast message, thereby, reducing the synchronous overhead, preventing possible token loss problems and minimizing control messages. The services are implemented by using finite state machine approach and they are highly efficient comparing with related services in the same network settings.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115761794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574024
V. Zerbe, Harald Keller, G. Schorcht
In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in the human body, especially in the heart muscle, can be diagnosed and localised by means of biomagnetic field parameters. The benefit of this diagnosis method is to fit an individual reference model of the heart field of a patient. Based on differences between the reference model and the real measured biomagnetic field parameters, the type and the position of defects in the heart can be located. The most time consuming components of the whole algorithm are the matrix computations, especially the matrix inversion. The matrix inversion can be implemented on a parallel distributed memory system. In this paper we discuss the routing, the parallel matrix inversion, and the speed up for different network topologies that depends on the number of processors and different problem sizes.
{"title":"Parallel matrix computations and their applications for biomagnetic fields","authors":"V. Zerbe, Harald Keller, G. Schorcht","doi":"10.1109/APDC.1997.574024","DOIUrl":"https://doi.org/10.1109/APDC.1997.574024","url":null,"abstract":"In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in the human body, especially in the heart muscle, can be diagnosed and localised by means of biomagnetic field parameters. The benefit of this diagnosis method is to fit an individual reference model of the heart field of a patient. Based on differences between the reference model and the real measured biomagnetic field parameters, the type and the position of defects in the heart can be located. The most time consuming components of the whole algorithm are the matrix computations, especially the matrix inversion. The matrix inversion can be implemented on a parallel distributed memory system. In this paper we discuss the routing, the parallel matrix inversion, and the speed up for different network topologies that depends on the number of processors and different problem sizes.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134477873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574047
Xiaodong Fu, Dingxing Wang, Weimin Zheng
R-tree is a very popular dynamic access structure cable of storing multidimensional and spatial data. Considering it's merit of the efficient global balance and dynamic reorganization, we try to use R-tree to decluster the multiattribute data in database system or file system. As many previous multiattribute declustering mechanisms do not take into account the properties of the Cluster of Workstations (COW), we present the Global Parallel R-tree (GPR-Tree) under the architecture of COW. Firstly we inspect the issues in efficiency of R-tree and it's variants, we try to enhance the R-Tree efficiency by using heuristics information in the reconstruction of R-Tree during the node splitting and the treatment of the orphan entries of the underfilled node. Then we parallelize the improved R-Tree among the components in the system. The basic thought is to alleviate the bottleneck effect of the I/O subsystem, making use of the high speed network communication and the memory. The GPR-Tree is shared among the processing units (PU) of the system. We use a mixed LRU algorithm to schedule pages in memory to maintain the nodes visited frequently in memory. A write-update-like protocol is used to keep the coherency among multiple copies maintained in the system. This mechanism is proved efficient to improve the salability and performance of the system.
{"title":"GPR-Tree: a global parallel index structure for multiattribute declustering on cluster of workstations","authors":"Xiaodong Fu, Dingxing Wang, Weimin Zheng","doi":"10.1109/APDC.1997.574047","DOIUrl":"https://doi.org/10.1109/APDC.1997.574047","url":null,"abstract":"R-tree is a very popular dynamic access structure cable of storing multidimensional and spatial data. Considering it's merit of the efficient global balance and dynamic reorganization, we try to use R-tree to decluster the multiattribute data in database system or file system. As many previous multiattribute declustering mechanisms do not take into account the properties of the Cluster of Workstations (COW), we present the Global Parallel R-tree (GPR-Tree) under the architecture of COW. Firstly we inspect the issues in efficiency of R-tree and it's variants, we try to enhance the R-Tree efficiency by using heuristics information in the reconstruction of R-Tree during the node splitting and the treatment of the orphan entries of the underfilled node. Then we parallelize the improved R-Tree among the components in the system. The basic thought is to alleviate the bottleneck effect of the I/O subsystem, making use of the high speed network communication and the memory. The GPR-Tree is shared among the processing units (PU) of the system. We use a mixed LRU algorithm to schedule pages in memory to maintain the nodes visited frequently in memory. A write-update-like protocol is used to keep the coherency among multiple copies maintained in the system. This mechanism is proved efficient to improve the salability and performance of the system.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126914746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574008
D. Zhang
Neural system, as processors of time-sequence patterns, have been successfully applied to several speaker-dependent speech recognition computing. They can be efficiently implemented by a pipelined architecture. In this paper, parallel time-delay speech recognition computing for VLSI neural systems is presented. The system design methodology is to emphasize coordination between computational model, architectural description, and VLSI systolic implementation. Examples of time-delay speech recognition applications to VLSI neural system design and performance analysis are given to illustrate effectiveness of the parallel computation.
{"title":"Parallel VLSI neural system design for time-delay speech recognition computing","authors":"D. Zhang","doi":"10.1109/APDC.1997.574008","DOIUrl":"https://doi.org/10.1109/APDC.1997.574008","url":null,"abstract":"Neural system, as processors of time-sequence patterns, have been successfully applied to several speaker-dependent speech recognition computing. They can be efficiently implemented by a pipelined architecture. In this paper, parallel time-delay speech recognition computing for VLSI neural systems is presented. The system design methodology is to emphasize coordination between computational model, architectural description, and VLSI systolic implementation. Examples of time-delay speech recognition applications to VLSI neural system design and performance analysis are given to illustrate effectiveness of the parallel computation.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130583006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-03-19DOI: 10.1109/APDC.1997.574062
Siwei Luo, Anfeng Huang, Yaping Huang
This paper introduces an algorithm that can generate huge node data flow by compiling existing programs. The purpose of this algorithm is to improve the speed of parallel processing and utilize the large amount of existing program resources. In addition, this idea of huge node data flow algorithm can also be used in distributed processing and multi-thread processing.
{"title":"Parallel processing on traditional serial programs by huge node data flow","authors":"Siwei Luo, Anfeng Huang, Yaping Huang","doi":"10.1109/APDC.1997.574062","DOIUrl":"https://doi.org/10.1109/APDC.1997.574062","url":null,"abstract":"This paper introduces an algorithm that can generate huge node data flow by compiling existing programs. The purpose of this algorithm is to improve the speed of parallel processing and utilize the large amount of existing program resources. In addition, this idea of huge node data flow algorithm can also be used in distributed processing and multi-thread processing.","PeriodicalId":413925,"journal":{"name":"Proceedings. Advances in Parallel and Distributed Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130214952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}