Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466999
Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pi-Yu Chung, C. Kintala
The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.<>
{"title":"Checkpointing and its applications","authors":"Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pi-Yu Chung, C. Kintala","doi":"10.1109/FTCS.1995.466999","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466999","url":null,"abstract":"The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133375472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466964
J. Plank, Youngbae Kim, J. Dongarra
The paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "network of workstations" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.<>
{"title":"Algorithm-based diskless checkpointing for fault tolerant matrix operations","authors":"J. Plank, Youngbae Kim, J. Dongarra","doi":"10.1109/FTCS.1995.466964","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466964","url":null,"abstract":"The paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the \"network of workstations\" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131352191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466981
H. Yotsuyanagi, S. Kajihara, K. Kinoshita
The existence of sequential redundancy degrades testability of sequential circuits. By using retiming which rearranges flip-flops, some sequential redundancy is converted into combinational redundancy, which can be easily identified and removed by a combinational test generation technique. Retiming is utilized for two purposes: one is for finding sequential redundancy and another is for reducing the number of flip-flops. Applying retiming and redundancy removal techniques concurrently, testability of sequential circuits is enhanced. Experimental results for ISCAS'89 benchmark circuits show the effectiveness of this method for optimizing circuits.<>
{"title":"Synthesis for testability by sequential redundancy removal using retiming","authors":"H. Yotsuyanagi, S. Kajihara, K. Kinoshita","doi":"10.1109/FTCS.1995.466981","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466981","url":null,"abstract":"The existence of sequential redundancy degrades testability of sequential circuits. By using retiming which rearranges flip-flops, some sequential redundancy is converted into combinational redundancy, which can be easily identified and removed by a combinational test generation technique. Retiming is utilized for two purposes: one is for finding sequential redundancy and another is for reducing the number of flip-flops. Applying retiming and redundancy removal techniques concurrently, testability of sequential circuits is enhanced. Experimental results for ISCAS'89 benchmark circuits show the effectiveness of this method for optimizing circuits.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116015518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466977
P. Dahlgren, P. Lidén
A two-step switch-level algorithm for fault simulation of transients in CMOS networks is presented. The first step models the fault propagation locally from the fault injection site to the subsequent CMOS blocks. It is shown that the pulse width of a transient is a vital parameter in the propagation process. A first-order RC network model for the prediction of the width of transients is used. The second step consists of a set of rules for the propagation of fully developed transients through basic CMOS blocks. The fact that transients may fade out during propagation is efficiently modeled by taking into account their pulse widths. The proposed algorithm shows good agreement with electrical-level simulations in predicting the effects of device-level transients.<>
{"title":"A switch-level algorithm for simulation of transients in combinational logic","authors":"P. Dahlgren, P. Lidén","doi":"10.1109/FTCS.1995.466977","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466977","url":null,"abstract":"A two-step switch-level algorithm for fault simulation of transients in CMOS networks is presented. The first step models the fault propagation locally from the fault injection site to the subsequent CMOS blocks. It is shown that the pulse width of a transient is a vital parameter in the propagation process. A first-order RC network model for the prediction of the width of transients is used. The second step consists of a set of rules for the propagation of fully developed transients through basic CMOS blocks. The fact that transients may fade out during propagation is efficiently modeled by taking into account their pulse widths. The proposed algorithm shows good agreement with electrical-level simulations in predicting the effects of device-level transients.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123429410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466966
C. Feng, Wei-Kang Huang, F. Lombardi
Existing one-step diagnosis approaches for faults in interconnects either yield a long test sequence, or use a non-generalized procedure to generate a shorter test sequence. We propose a new diagnosis approach for short faults in interconnects. The pin-adjacency fault model is assumed. By using a divide-and-conquer strategy, our approach can generate a very compact test vector sequence which can diagnose an unrestricted number of short faults. Our experiments for three benchmarks as well as large random interconnects (up to 50,000 nets) show that our approach can achieve more than 50% savings in the length of the generated test sequence. This can significantly save the diagnosis cost for boundary-scan testing. An adaptive diagnosis approach is further proposed to dynamically truncate the originally generated test sequence based on the current information of faulty nets. The performance of our adaptive approach in terms of the on-line test generation time and the resulting test sequence length is better than for existing adaptive diagnosis approaches when the fault rate is not very small, such as in a new product line. If a low complexity for the ATE is of major importance, then the proposed one-step approach is the best choice.<>
{"title":"A new diagnosis approach for short faults in interconnects","authors":"C. Feng, Wei-Kang Huang, F. Lombardi","doi":"10.1109/FTCS.1995.466966","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466966","url":null,"abstract":"Existing one-step diagnosis approaches for faults in interconnects either yield a long test sequence, or use a non-generalized procedure to generate a shorter test sequence. We propose a new diagnosis approach for short faults in interconnects. The pin-adjacency fault model is assumed. By using a divide-and-conquer strategy, our approach can generate a very compact test vector sequence which can diagnose an unrestricted number of short faults. Our experiments for three benchmarks as well as large random interconnects (up to 50,000 nets) show that our approach can achieve more than 50% savings in the length of the generated test sequence. This can significantly save the diagnosis cost for boundary-scan testing. An adaptive diagnosis approach is further proposed to dynamically truncate the originally generated test sequence based on the current information of faulty nets. The performance of our adaptive approach in terms of the on-line test generation time and the resulting test sequence length is better than for existing adaptive diagnosis approaches when the fault rate is not very small, such as in a new product line. If a low complexity for the ATE is of major importance, then the proposed one-step approach is the best choice.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-27DOI: 10.1109/FTCS.1995.466998
L. Moser, P. Melliar-Smith, D. Agarwal, R. K. Budhia, C. Lingley-Papadopoulos, T. P. Archambault
The Totem system supports fault-tolerant applications in which distributed processes cooperate to perform a common task and in which replicated data must be updated consistently in the presence of asynchrony and faults. Reliable totally ordered delivery of messages to processes within process groups is provided on a single local-area network or over multiple local-area networks interconnected by gateways. Message ordering is consistent across the entire network, despite processor and communication faults, without requiring all processes to deliver all messages. The Totem system handles processor failure and recovery, as well as network partitioning and remerging, and provides membership and topology maintenance services.<>
{"title":"The Totem system","authors":"L. Moser, P. Melliar-Smith, D. Agarwal, R. K. Budhia, C. Lingley-Papadopoulos, T. P. Archambault","doi":"10.1109/FTCS.1995.466998","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466998","url":null,"abstract":"The Totem system supports fault-tolerant applications in which distributed processes cooperate to perform a common task and in which replicated data must be updated consistently in the presence of asynchrony and faults. Reliable totally ordered delivery of messages to processes within process groups is provided on a single local-area network or over multiple local-area networks interconnected by gateways. Message ordering is consistent across the entire network, despite processor and communication faults, without requiring all processes to deliver all messages. The Totem system handles processor failure and recovery, as well as network partitioning and remerging, and provides membership and topology maintenance services.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131226317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-06-01DOI: 10.1109/FTCS.1995.466949
J. Fabre, V. Nicomette, T. Pérennou, R. Stroud, Zhixue Wu
Shows how reflection and object-oriented programming can be used to ease the implementation of classical fault tolerance mechanisms in distributed applications. When the underlying runtime system does not provide fault tolerance transparently, classical approaches to implementing fault tolerance mechanisms often imply mixing functional programming with non-functional programming (e.g. error processing mechanisms). The use of reflection improves the transparency of fault tolerance mechanisms to the programmer and more generally provides a clearer separation between functional and non-functional programming. The implementations of some classical replication techniques using a reflective approach are presented in detail and illustrated by several examples, which have been prototyped on a network of Unix workstations. Lessons learnt from our experiments are drawn and future work is discussed.<>
{"title":"Implementing fault tolerant applications using reflective object-oriented programming","authors":"J. Fabre, V. Nicomette, T. Pérennou, R. Stroud, Zhixue Wu","doi":"10.1109/FTCS.1995.466949","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466949","url":null,"abstract":"Shows how reflection and object-oriented programming can be used to ease the implementation of classical fault tolerance mechanisms in distributed applications. When the underlying runtime system does not provide fault tolerance transparently, classical approaches to implementing fault tolerance mechanisms often imply mixing functional programming with non-functional programming (e.g. error processing mechanisms). The use of reflection improves the transparency of fault tolerance mechanisms to the programmer and more generally provides a clearer separation between functional and non-functional programming. The implementations of some classical replication techniques using a reflective approach are presented in detail and illustrated by several examples, which have been prototyped on a network of Unix workstations. Lessons learnt from our experiments are drawn and future work is discussed.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128386177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}