Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications.
{"title":"Supporting fault-tolerance in heterogeneous distributed applications","authors":"P. Maheshwari, J. Ouyang","doi":"10.1109/HCW.1997.581421","DOIUrl":"https://doi.org/10.1109/HCW.1997.581421","url":null,"abstract":"Heterogeneous computing opens up new challenges and opportunities in fields such as parallel and distributed processing, design of algorithms for applications, scheduling of parallel tasks, interconnection network technology and support for reliable distributed heterogeneous computing. A trend of supporting fault-tolerance in distributed computing systems is to incorporate fault-tolerance into applications at low cost, in terms of both run time performance and programming effort required to construct reliable application software. We present an approach for developing efficient reliable distributed applications for heterogeneous computing systems. We propose a library prototype, called H-Libra, to support fault-tolerance in heterogeneous systems with low run-time cost. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, minimum communication overhead is involved for taking a consistent distributed checkpoint and catching messages in transit during a checkpoint. By providing fault-tolerance transparency and a simple, easy to use high-level message-passing interface, H-Libra simplifies the development of reliable heterogeneous distributed applications.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116828782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A stochastic Petri net (SPN) is systematically constructed from a task graph whose component subtasks are statically allocated onto the processor suite of a heterogeneous computing system (HCS). Given that subtask execution times are exponentially distributed an exponential distribution can be generated for the overall completion time. In particular the enabling functions and rate functions used to specify the SPN model provide needed versatility to integrate processor heterogeneity, task priorities, allocation schemes, communication costs, and other factors characteristic of a HCS into a comprehensive performance analysis. The manner in which these parameters are incorporated into the SPN allows the model to be transformed into a testbed for optimization schemes and heuristics. The proposed approach can be applied to arbitrary task graphs including non-series-parallel.
{"title":"Stochastic Petri nets applied to the performance evaluation of static task allocations in heterogeneous computing environments","authors":"A. McSpadden, N. Lopez-Benitez","doi":"10.1109/HCW.1997.581420","DOIUrl":"https://doi.org/10.1109/HCW.1997.581420","url":null,"abstract":"A stochastic Petri net (SPN) is systematically constructed from a task graph whose component subtasks are statically allocated onto the processor suite of a heterogeneous computing system (HCS). Given that subtask execution times are exponentially distributed an exponential distribution can be generated for the overall completion time. In particular the enabling functions and rate functions used to specify the SPN model provide needed versatility to integrate processor heterogeneity, task priorities, allocation schemes, communication costs, and other factors characteristic of a HCS into a comprehensive performance analysis. The manner in which these parameters are incorporated into the SPN allows the model to be transformed into a testbed for optimization schemes and heuristics. The proposed approach can be applied to arbitrary task graphs including non-series-parallel.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127446764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The different types of messages used by a parallel application program executing in a distributed system can each have unique characteristics so that no single communication network can produce the lowest latency for all messages. For instance, short control messages may be sent with the lowest overhead on one type of network, such as Ethernet, while bulk data transfers may be better suited to a different type of network, such as Fibre Channel or HiPPI. In this paper, we investigate how to exploit multiple heterogeneous communication networks that interconnect the same set of processing nodes by dynamically selecting the best (lowest latency) network for each message based on the message size. We also show how to aggregate these multiple parallel networks into a single virtual network to further reduce the latency and increase the available bandwidth. We test this multiplexing and aggregation on a cluster of SGI multiprocessors interconnected with both Fibre Channel and Ethernet. We find that multiplexing between Ethernet and Fibre Channel can substantially reduce communication overhead in a synthetic benchmark compared to using either network alone. Aggregating these two networks into a single virtual network can further reduce communication delays for applications with many large messages. The best choice of either multiplexing or aggregation depends on the mix of message sizes in application program and the relative overheads of the two networks.
{"title":"Exploiting multiple heterogeneous networks to reduce communication costs in parallel programs","authors":"JunSeong Kim, D. Lilja","doi":"10.1109/HCW.1997.581412","DOIUrl":"https://doi.org/10.1109/HCW.1997.581412","url":null,"abstract":"The different types of messages used by a parallel application program executing in a distributed system can each have unique characteristics so that no single communication network can produce the lowest latency for all messages. For instance, short control messages may be sent with the lowest overhead on one type of network, such as Ethernet, while bulk data transfers may be better suited to a different type of network, such as Fibre Channel or HiPPI. In this paper, we investigate how to exploit multiple heterogeneous communication networks that interconnect the same set of processing nodes by dynamically selecting the best (lowest latency) network for each message based on the message size. We also show how to aggregate these multiple parallel networks into a single virtual network to further reduce the latency and increase the available bandwidth. We test this multiplexing and aggregation on a cluster of SGI multiprocessors interconnected with both Fibre Channel and Ethernet. We find that multiplexing between Ethernet and Fibre Channel can substantially reduce communication overhead in a synthetic benchmark compared to using either network alone. Aggregating these two networks into a single virtual network can further reduce communication delays for applications with many large messages. The best choice of either multiplexing or aggregation depends on the mix of message sizes in application program and the relative overheads of the two networks.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"70 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129639766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous computing covers a great variety of situations. This study focuses on a particular application domain (iterative automatic target recognition tasks) and an associated specific class of dedicated heterogeneous hardware platforms. The contribution of this paper is that, for the computational environment considered, it presents a methodology for real-time on-line input-data dependent remappings of the application subtasks to the processors in the heterogeneous hardware platform using previously stored off-line statically determined mappings. That is, the operating system will be able to decide during the execution of the application whether or not to perform a remapping based on information generated by the application from its input data. If the decision is to remap, the operating system will be able to select a previously derived and stored mapping that is appropriate for the given state of the application (e.g., the number of objects it is currently tracking).
{"title":"On-line use of off-line derived mappings for iterative automatic target recognition tasks and a particular class of hardware platforms","authors":"J. Budenske, R. Ramanujan, H. Siegel","doi":"10.1109/HCW.1997.581413","DOIUrl":"https://doi.org/10.1109/HCW.1997.581413","url":null,"abstract":"Heterogeneous computing covers a great variety of situations. This study focuses on a particular application domain (iterative automatic target recognition tasks) and an associated specific class of dedicated heterogeneous hardware platforms. The contribution of this paper is that, for the computational environment considered, it presents a methodology for real-time on-line input-data dependent remappings of the application subtasks to the processors in the heterogeneous hardware platform using previously stored off-line statically determined mappings. That is, the operating system will be able to decide during the execution of the application whether or not to perform a remapping based on information generated by the application from its input data. If the decision is to remap, the operating system will be able to select a previously derived and stored mapping that is appropriate for the given state of the application (e.g., the number of objects it is currently tracking).","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126545938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed systems have the potentiality of becoming an alternative platform for parallel computations. However, there are still many obstacles to overcome, one of the most serious is that distributed systems typically consist of shared heterogeneous components with highly variable computational power. We present a load balancing support that checks the load status and, if necessary, adapts the workload to dynamic platform conditions through data migrations from overloaded to underloaded nodes. Unlike task migration supports for task parallelism and other data migration frameworks for master/slave-based parallel applications, our support works for the entire class of SPMD regular applications with explicit communications such as linear algebra problems, partial differential equation solvers, image processing algorithms. Although we considered several variants (three activation mechanisms, three load monitoring techniques and four decision policies), we implemented only the protocols that guarantee program consistency. The efficiency of the strategies is tested in the instance of two SPMD algorithms that are based on the PVM library enriched by special-purpose primitives for data management. As additional contribution, our research keeps the entire support for dynamic load balancing transparent to the programmer. The only visible interface of our support is the activation phase.
{"title":"Dynamic load balancing of distributed SPMD computations with explicit message-passing","authors":"M. Cermele, M. Colajanni, G. Necci","doi":"10.1109/HCW.1997.581406","DOIUrl":"https://doi.org/10.1109/HCW.1997.581406","url":null,"abstract":"Distributed systems have the potentiality of becoming an alternative platform for parallel computations. However, there are still many obstacles to overcome, one of the most serious is that distributed systems typically consist of shared heterogeneous components with highly variable computational power. We present a load balancing support that checks the load status and, if necessary, adapts the workload to dynamic platform conditions through data migrations from overloaded to underloaded nodes. Unlike task migration supports for task parallelism and other data migration frameworks for master/slave-based parallel applications, our support works for the entire class of SPMD regular applications with explicit communications such as linear algebra problems, partial differential equation solvers, image processing algorithms. Although we considered several variants (three activation mechanisms, three load monitoring techniques and four decision policies), we implemented only the protocols that guarantee program consistency. The efficiency of the strategies is tested in the instance of two SPMD algorithms that are based on the PVM library enriched by special-purpose primitives for data management. As additional contribution, our research keeps the entire support for dynamic load balancing transparent to the programmer. The only visible interface of our support is the activation phase.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"1212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116484299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A case study was conducted to examine the performance and portability of parallel applications, with an emphasis on data transfer among the processors in heterogeneous environments. Several parallel test programs using MPICH, a message passing interface (MPI) library, and the Linda parallel environment were developed to analyze communication performance and portability. These programs implement loosely and tightly synchronized communication models in which each processor exchanges data with two other processors. This data-exchange pattern mimics communication in certain parallel applications using striped partitioning of the computational domain. Tests were performed on an isolated, distributed computing testbed, a live development network and a symmetrical multiprocessing computer system. All network configurations used asynchronous transfer mode (ATM) network technologies. The testbed used in the study was a heterogeneous network consisting of various workstations and networking equipment. This paper presents an analysis of the results and recommendations for designing and implementing course-grained, parallel, scientific applications.
{"title":"A performance and portability study of parallel applications using a distributed computing testbed","authors":"V. Morariu, Mathew Cunningham, Mark Letterman","doi":"10.1109/HCW.1997.581423","DOIUrl":"https://doi.org/10.1109/HCW.1997.581423","url":null,"abstract":"A case study was conducted to examine the performance and portability of parallel applications, with an emphasis on data transfer among the processors in heterogeneous environments. Several parallel test programs using MPICH, a message passing interface (MPI) library, and the Linda parallel environment were developed to analyze communication performance and portability. These programs implement loosely and tightly synchronized communication models in which each processor exchanges data with two other processors. This data-exchange pattern mimics communication in certain parallel applications using striped partitioning of the computational domain. Tests were performed on an isolated, distributed computing testbed, a live development network and a symmetrical multiprocessing computer system. All network configurations used asynchronous transfer mode (ATM) network technologies. The testbed used in the study was a heterogeneous network consisting of various workstations and networking equipment. This paper presents an analysis of the results and recommendations for designing and implementing course-grained, parallel, scientific applications.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115361460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous parallel processing systems have been extensively used in embedded military applications due to their advantages in size, weight, power and hardware cost. This paper reviews the evolution of some of these systems and discusses design factors and tradeoffs which affect their application. As military systems have become more cost sensitive, and initial development more common than long term production, the use of commercial hardware and software has become more common. The rapid advances of computer technology seem likely to accelerate that trend in the future.
{"title":"Practical issues in heterogeneous processing systems for military applications","authors":"G. Ladd","doi":"10.1109/HCW.1997.581418","DOIUrl":"https://doi.org/10.1109/HCW.1997.581418","url":null,"abstract":"Heterogeneous parallel processing systems have been extensively used in embedded military applications due to their advantages in size, weight, power and hardware cost. This paper reviews the evolution of some of these systems and discusses design factors and tradeoffs which affect their application. As military systems have become more cost sensitive, and initial development more common than long term production, the use of commercial hardware and software has become more common. The rapid advances of computer technology seem likely to accelerate that trend in the future.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115654604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed systems comprising networked heterogeneous workstations are now considered to be a viable choice for high-performance computing. For achieving a fast response time from such systems, an efficient assignment of the application tasks to the processors is imperative. The general assignment problem is known to be NP-hard, except in a few special cases with strict assumptions. While a large number of heuristic techniques have been suggested in the literature that can yield sub-optimal solutions in a reasonable amount of time, we aim to develop techniques for optimal solutions under relaxed assumptions. The basis of our research is a best-first search technique known as the A* algorithm from the area of artificial intelligence. The original search technique guarantees an optimal solution but is not feasible for problems of practically large sizes due to its high time and space complexity. We propose a number of algorithms based around the A* technique. The proposed algorithms also yield optimal solutions but are considerably faster. The first algorithm solves the assignment problem by using parallel processing. Parallelizing the assignment algorithm is a natural way to lower the time complexity, and we believe our algorithm to be novel in this regard. The second algorithm is based on a clustering based pre-processing technique that merges the high affinity tasks. Clustering reduces the problem size, which in turn reduces the state-space for the assignment algorithm. We also propose three heuristics which do not guarantee optimal solutions but provide near-optimal solutions and are considerably faster. By using our parallel formulation, the proposed clustering technique and the heuristics can also be parallelized to further improve their time complexity.
{"title":"Optimal task assignment in heterogeneous computing systems","authors":"Muhammad Kafil, I. Ahmad","doi":"10.1109/HCW.1997.581416","DOIUrl":"https://doi.org/10.1109/HCW.1997.581416","url":null,"abstract":"Distributed systems comprising networked heterogeneous workstations are now considered to be a viable choice for high-performance computing. For achieving a fast response time from such systems, an efficient assignment of the application tasks to the processors is imperative. The general assignment problem is known to be NP-hard, except in a few special cases with strict assumptions. While a large number of heuristic techniques have been suggested in the literature that can yield sub-optimal solutions in a reasonable amount of time, we aim to develop techniques for optimal solutions under relaxed assumptions. The basis of our research is a best-first search technique known as the A* algorithm from the area of artificial intelligence. The original search technique guarantees an optimal solution but is not feasible for problems of practically large sizes due to its high time and space complexity. We propose a number of algorithms based around the A* technique. The proposed algorithms also yield optimal solutions but are considerably faster. The first algorithm solves the assignment problem by using parallel processing. Parallelizing the assignment algorithm is a natural way to lower the time complexity, and we believe our algorithm to be novel in this regard. The second algorithm is based on a clustering based pre-processing technique that merges the high affinity tasks. Clustering reduces the problem size, which in turn reduces the state-space for the assignment algorithm. We also propose three heuristics which do not guarantee optimal solutions but provide near-optimal solutions and are considerably faster. By using our parallel formulation, the proposed clustering technique and the heuristics can also be parallelized to further improve their time complexity.","PeriodicalId":286909,"journal":{"name":"Proceedings Sixth Heterogeneous Computing Workshop (HCW'97)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}