Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962108
J. Anwer, M. Platzner
With increasing error-proneness of nano-circuits, a number of fault-tolerance approaches are presented in the literature to enhance circuit reliability. The evaluation of the effectiveness of fault-tolerant circuit structures remains a challenge. An analytical model is required to provide exact figures of reliability of a circuit design, to be able to locate error-sensitive parts of the circuit as well as to compare different fault-tolerance approaches. At the logic layer, probabilistic approaches exist that provide such measures of circuit reliability, but they do not consider the reliability-enhancement effect of fault-tolerant circuit structures. Furthermore, these approaches are often not applicable for large circuits and their complexity hinders the development of generic simulation tools. In this paper we combine the Boolean difference error calculator (BDEC), a previous probabilistic reliability model for hardware designs, with a reliability model for fault-tolerant circuit structures. As a result we are able to study the reliability of fault-tolerant circuit structures at the logic layer. We focus on fault-tolerant circuits to be implemented in FPGAs and show how to extend our combined model from combinational to sequential circuits. For analyzing larger circuits, we develop a MATLAB-based tool utilizing BDEC. With this tool, we perform a variability analysis of different input parameters, such as logic component, input and voter error probabilities, to observe their single and joint effect on the circuit reliability. Our analyses show that circuit reliability depends strongly on the circuit structure due to error-masking effects of components on each other. Moreover, the benefit of redundancy can be obtained up to a certain threshold of component, input and voter error probabilities though the voter reliability has the strongest impact on overall circuit reliability.
{"title":"Analytic reliability evaluation for fault-tolerant circuit structures on FPGAs","authors":"J. Anwer, M. Platzner","doi":"10.1109/DFT.2014.6962108","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962108","url":null,"abstract":"With increasing error-proneness of nano-circuits, a number of fault-tolerance approaches are presented in the literature to enhance circuit reliability. The evaluation of the effectiveness of fault-tolerant circuit structures remains a challenge. An analytical model is required to provide exact figures of reliability of a circuit design, to be able to locate error-sensitive parts of the circuit as well as to compare different fault-tolerance approaches. At the logic layer, probabilistic approaches exist that provide such measures of circuit reliability, but they do not consider the reliability-enhancement effect of fault-tolerant circuit structures. Furthermore, these approaches are often not applicable for large circuits and their complexity hinders the development of generic simulation tools. In this paper we combine the Boolean difference error calculator (BDEC), a previous probabilistic reliability model for hardware designs, with a reliability model for fault-tolerant circuit structures. As a result we are able to study the reliability of fault-tolerant circuit structures at the logic layer. We focus on fault-tolerant circuits to be implemented in FPGAs and show how to extend our combined model from combinational to sequential circuits. For analyzing larger circuits, we develop a MATLAB-based tool utilizing BDEC. With this tool, we perform a variability analysis of different input parameters, such as logic component, input and voter error probabilities, to observe their single and joint effect on the circuit reliability. Our analyses show that circuit reliability depends strongly on the circuit structure due to error-masking effects of components on each other. Moreover, the benefit of redundancy can be obtained up to a certain threshold of component, input and voter error probabilities though the voter reliability has the strongest impact on overall circuit reliability.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129394523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962103
G. Chapman, Rohit Thomas, Rahul Thomas, I. Koren, Z. Koren
From extensive study of digital imager defects, we found that “Hot Pixels” are the main digital camera defects, and that they increase at a nearly constant temporal rate over the camera's lifetime. Previously we characterized the hot pixels by a linear function of the exposure time in response to a dark frame setting. Using a camera with 55 known hot pixels, we compared our hot pixel correction algorithm to a conventional 4-nearest neighbor interpolation techniques. We developed a new “moving camera” method to exactly obtain both the actual hot pixel contribution and the true undamaged pixel value at a defect. Using these calibrated results we find that the correction method should be based on the hot pixel severity, the illumination intensity at the pixel, camera parameters such as ISO and exposure time, and on the neighboring pixels' variability.
{"title":"Improved correction for hot pixels in digital imagers","authors":"G. Chapman, Rohit Thomas, Rahul Thomas, I. Koren, Z. Koren","doi":"10.1109/DFT.2014.6962103","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962103","url":null,"abstract":"From extensive study of digital imager defects, we found that “Hot Pixels” are the main digital camera defects, and that they increase at a nearly constant temporal rate over the camera's lifetime. Previously we characterized the hot pixels by a linear function of the exposure time in response to a dark frame setting. Using a camera with 55 known hot pixels, we compared our hot pixel correction algorithm to a conventional 4-nearest neighbor interpolation techniques. We developed a new “moving camera” method to exactly obtain both the actual hot pixel contribution and the true undamaged pixel value at a defect. Using these calibrated results we find that the correction method should be based on the hot pixel severity, the illumination intensity at the pixel, camera parameters such as ISO and exposure time, and on the neighboring pixels' variability.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129548426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962075
M. Haghbayan, A. Rahmani, P. Liljeberg, J. Plosila, H. Tenhunen
Dark Silicon issue stresses that a fraction of silicon chip being able to switch in a full frequency is dropping and designers will soon face a growing underutilization inherent in future technology scaling. On the other hand, by reducing the transistor sizes, susceptibility to internal defects increases and large range of defects such as aging or transient faults will be shown up more frequently. In this paper, we propose an online concurrent test scheduling approach for the fraction of chip that cannot be utilized due to the restricted utilization wall. Dynamic voltage and frequency scaling including near-threshold operation is utilized in order to maximize the concurrency of the online testing process under the constant power. As the dark area of the system is dynamic and reshapes at a runtime, our approach dynamically tests unused cores in a runtime to provided tested cores for upcoming application and hence enhance system reliability. Empirical results show that our proposed concurrent testing approach using dynamic voltage and frequency scaling (DVFS) improves the overall test throughput by over 250% compared to the state-of-the-art dark silicon aware online testing approaches under the same power budget.
{"title":"Energy-efficient concurrent testing approach for many-core systems in the dark silicon age","authors":"M. Haghbayan, A. Rahmani, P. Liljeberg, J. Plosila, H. Tenhunen","doi":"10.1109/DFT.2014.6962075","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962075","url":null,"abstract":"Dark Silicon issue stresses that a fraction of silicon chip being able to switch in a full frequency is dropping and designers will soon face a growing underutilization inherent in future technology scaling. On the other hand, by reducing the transistor sizes, susceptibility to internal defects increases and large range of defects such as aging or transient faults will be shown up more frequently. In this paper, we propose an online concurrent test scheduling approach for the fraction of chip that cannot be utilized due to the restricted utilization wall. Dynamic voltage and frequency scaling including near-threshold operation is utilized in order to maximize the concurrency of the online testing process under the constant power. As the dark area of the system is dynamic and reshapes at a runtime, our approach dynamically tests unused cores in a runtime to provided tested cores for upcoming application and hence enhance system reliability. Empirical results show that our proposed concurrent testing approach using dynamic voltage and frequency scaling (DVFS) improves the overall test throughput by over 250% compared to the state-of-the-art dark silicon aware online testing approaches under the same power budget.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121583805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962082
Paniz Foroutan, M. Kamal, Z. Navabi
By increasing the impact of process variation on the uncertainty of the delay of the gates, and also the need for increasing the number of test paths, delay test has become an essential part of the chip testing. In this paper, a heuristic test path selection method is proposed that is a combination of the non-optimal and optimal selection methods. In the first step of the proposed selection method, the search space is reduced by considering correlations between the paths. Next, by using ILP formulation, best paths from the reduced search space are selected. For the ILP formulation, we have proposed an objective function which considers correlation and the criticality of the paths. The results show that the delay failure capturing probability (DFCP) of the proposed path selection method for eight largest ITC'99 benchmarks, on average, is only about 3% smaller than the Monte Carlo method, while its runtime is about 1340 times smaller than the Monte Carlo approach.
{"title":"A heuristic path selection method for small delay defects test","authors":"Paniz Foroutan, M. Kamal, Z. Navabi","doi":"10.1109/DFT.2014.6962082","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962082","url":null,"abstract":"By increasing the impact of process variation on the uncertainty of the delay of the gates, and also the need for increasing the number of test paths, delay test has become an essential part of the chip testing. In this paper, a heuristic test path selection method is proposed that is a combination of the non-optimal and optimal selection methods. In the first step of the proposed selection method, the search space is reduced by considering correlations between the paths. Next, by using ILP formulation, best paths from the reduced search space are selected. For the ILP formulation, we have proposed an objective function which considers correlation and the criticality of the paths. The results show that the delay failure capturing probability (DFCP) of the proposed path selection method for eight largest ITC'99 benchmarks, on average, is only about 3% smaller than the Monte Carlo method, while its runtime is about 1340 times smaller than the Monte Carlo approach.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129401641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962099
H. Dogan, Domenic Forte, M. Tehranipoor
The counterfeit electronic component industry continues to threaten the security and reliability of systems by infiltrating recycled components into the supply chain. With the increased use of FPGAs in critical systems, recycled FPGAs cause significant concerns for government and industry. In this paper, we propose a two phase detection approach to differentiate recycled (used) FPGAs from new ones. Both approaches rely on machine learning via support vector machines (SVM) for classification. The first phase examines suspect FPGAs “as is” while the second phase requires some accelerated aging. To be more specific, Phase I detects recycled FPGAs by comparing the frequencies of ring oscillators (ROs) distributed on the FPGAs against a golden model. Experimental results on Xilinx FPGAs show that Phase I can correctly classify 8 out of 20 FPGAs under test. However, Phase I fails to detect FPGAs at fast corners and with lesser prior usage. Phase II is then used to compliment Phase I and overcome its limitations. The second phase performs a short aging step on the suspect FPGAs and exploits the aging speed reduction (due to prior usage) to cover the cases missed by Phase I. In our silicon results, Phase II detects all the fresh and recycled FPGAs correctly.
{"title":"Aging analysis for recycled FPGA detection","authors":"H. Dogan, Domenic Forte, M. Tehranipoor","doi":"10.1109/DFT.2014.6962099","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962099","url":null,"abstract":"The counterfeit electronic component industry continues to threaten the security and reliability of systems by infiltrating recycled components into the supply chain. With the increased use of FPGAs in critical systems, recycled FPGAs cause significant concerns for government and industry. In this paper, we propose a two phase detection approach to differentiate recycled (used) FPGAs from new ones. Both approaches rely on machine learning via support vector machines (SVM) for classification. The first phase examines suspect FPGAs “as is” while the second phase requires some accelerated aging. To be more specific, Phase I detects recycled FPGAs by comparing the frequencies of ring oscillators (ROs) distributed on the FPGAs against a golden model. Experimental results on Xilinx FPGAs show that Phase I can correctly classify 8 out of 20 FPGAs under test. However, Phase I fails to detect FPGAs at fast corners and with lesser prior usage. Phase II is then used to compliment Phase I and overcome its limitations. The second phase performs a short aging step on the suspect FPGAs and exploits the aging speed reduction (due to prior usage) to cover the cases missed by Phase I. In our silicon results, Phase II detects all the fresh and recycled FPGAs correctly.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133387206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962074
A. Malek, S. Tzilis, D. Khan, I. Sourdis, Georgios Smaragdos, C. Strydis
Reconfigurable hardware can be employed to tolerate permanent faults. Hardware components comprising a System-on-Chip can be partitioned into a handful of substitutable units interconnected with reconfigurable wires to allow isolation and replacement of faulty parts. This paper offers a probabilistic analysis of reconfigurable designs estimating for different fault densities the average number of fault-free components that can be constructed as well as the probability to guarantee a particular availability of components. Considering the area overheads of reconfigurability, we evaluate the resilience of various reconfigurable designs with different granularities. Based on this analysis, we conduct a comprehensive design-space exploration to identify the granularity mixes that maximize the fault-tolerance of a system. Our findings reveal that mixing fine-grain logic with a coarse-grain sparing approach tolerates up to 3× more permanent faults than component redundancy and 2× more than any other purely coarse-grain solution. Component redundancy is preferable at low fault densities, while coarse-grain and mixed-grain reconfigurability maximize availability at medium and high fault densities, respectively.
{"title":"A probabilistic analysis of resilient reconfigurable designs","authors":"A. Malek, S. Tzilis, D. Khan, I. Sourdis, Georgios Smaragdos, C. Strydis","doi":"10.1109/DFT.2014.6962074","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962074","url":null,"abstract":"Reconfigurable hardware can be employed to tolerate permanent faults. Hardware components comprising a System-on-Chip can be partitioned into a handful of substitutable units interconnected with reconfigurable wires to allow isolation and replacement of faulty parts. This paper offers a probabilistic analysis of reconfigurable designs estimating for different fault densities the average number of fault-free components that can be constructed as well as the probability to guarantee a particular availability of components. Considering the area overheads of reconfigurability, we evaluate the resilience of various reconfigurable designs with different granularities. Based on this analysis, we conduct a comprehensive design-space exploration to identify the granularity mixes that maximize the fault-tolerance of a system. Our findings reveal that mixing fine-grain logic with a coarse-grain sparing approach tolerates up to 3× more permanent faults than component redundancy and 2× more than any other purely coarse-grain solution. Component redundancy is preferable at low fault densities, while coarse-grain and mixed-grain reconfigurability maximize availability at medium and high fault densities, respectively.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133280308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962107
H. Aziza, H. Ayari, S. Onkaraiah, J. Portal, M. Moreau, M. Bocquet
A deeper understanding of the impact of variability on Oxide-based Resistive Random Access Memory (so-called OxRRAM) is needed to propose variability tolerant designs to ensure the robustness of the technology. Although research has taken steps to resolve this issue, variability remains an important characteristic for OxRRAMs. In this paper, impact of variability on OxRRAM circuit performances is analysed quantitatively at a circuit level through electrical simulations. Variability is introduced at the memory cell level but also at the peripheral circuitry level. The aim of this study is to determine the contribution of each component of an OxRRAM circuit on the ON/OFF resistance ratio.
{"title":"Oxide based resistive RAM: ON/OFF resistance analysis versus circuit variability","authors":"H. Aziza, H. Ayari, S. Onkaraiah, J. Portal, M. Moreau, M. Bocquet","doi":"10.1109/DFT.2014.6962107","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962107","url":null,"abstract":"A deeper understanding of the impact of variability on Oxide-based Resistive Random Access Memory (so-called OxRRAM) is needed to propose variability tolerant designs to ensure the robustness of the technology. Although research has taken steps to resolve this issue, variability remains an important characteristic for OxRRAMs. In this paper, impact of variability on OxRRAM circuit performances is analysed quantitatively at a circuit level through electrical simulations. Variability is introduced at the memory cell level but also at the peripheral circuitry level. The aim of this study is to determine the contribution of each component of an OxRRAM circuit on the ON/OFF resistance ratio.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128138391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962083
Florian Haas, Sebastian Weis, Stefan Metzlaff, T. Ungerer
Safety-critical systems demand increasing computational power, which requests high-performance embedded systems. While commercial-of-the-shelf (COTS) processors offer high computational performance for a low price, they do not provide hardware support for fault-tolerant execution. However, pure software-based fault-tolerance methods entail high design complexity and runtime overhead. In this paper, we present an efficient software/hardware-based redundant execution scheme for a COTS ×86 processor, which exploits the Transactional Synchronization Extensions (TSX) introduced with the Intel Haswell microarchitecture. Our approach extends a static binary instrumentation tool to insert fault-tolerant transactions and fault-detection instructions at function granularity. TSX hardware support is used for error containment and recovery. The average runtime overhead for selected SPEC2006 benchmarks was only 49% compared to a non-fault-tolerant execution.
{"title":"Exploiting Intel TSX for fault-tolerant execution in safety-critical systems","authors":"Florian Haas, Sebastian Weis, Stefan Metzlaff, T. Ungerer","doi":"10.1109/DFT.2014.6962083","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962083","url":null,"abstract":"Safety-critical systems demand increasing computational power, which requests high-performance embedded systems. While commercial-of-the-shelf (COTS) processors offer high computational performance for a low price, they do not provide hardware support for fault-tolerant execution. However, pure software-based fault-tolerance methods entail high design complexity and runtime overhead. In this paper, we present an efficient software/hardware-based redundant execution scheme for a COTS ×86 processor, which exploits the Transactional Synchronization Extensions (TSX) introduced with the Intel Haswell microarchitecture. Our approach extends a static binary instrumentation tool to insert fault-tolerant transactions and fault-detection instructions at function granularity. TSX hardware support is used for error containment and recovery. The average runtime overhead for selected SPEC2006 benchmarks was only 49% compared to a non-fault-tolerant execution.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126911592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962080
B. Montrucchio, M. Rebaudengo, A. Velasco
Transient faults in computer-based systems for which high availability is a strict requirement, originated from several sources, like high energy particles, are a major issue. Fault injection is a commonly used method to evaluate the sensitivity of such systems. The paper presents an evaluation of the effects of faults in the memory containing the process descriptor of a Unix-based Operating System. In particular the state field has been taken into consideration as the main target, changing the current state value into another one that could be valid or invalid. An experimental analysis has been conducted on a large set of different tasks, belonging to the operating system itself. Results of tests show that the state field in the process descriptor represents a critical variable as far as dependability is considered.
{"title":"Fault injection in the process descriptor of a Unix-based operating system","authors":"B. Montrucchio, M. Rebaudengo, A. Velasco","doi":"10.1109/DFT.2014.6962080","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962080","url":null,"abstract":"Transient faults in computer-based systems for which high availability is a strict requirement, originated from several sources, like high energy particles, are a major issue. Fault injection is a commonly used method to evaluate the sensitivity of such systems. The paper presents an evaluation of the effects of faults in the memory containing the process descriptor of a Unix-based Operating System. In particular the state field has been taken into consideration as the main target, changing the current state value into another one that could be valid or invalid. An experimental analysis has been conducted on a large set of different tasks, belonging to the operating system itself. Results of tests show that the state field in the process descriptor represents a critical variable as far as dependability is considered.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123082823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-11-24DOI: 10.1109/DFT.2014.6962066
Anup Das, Akash Kumar, B. Veeravalli
Fault-tolerance is emerging as one of the important optimization objectives for designs in deep submicron technology nodes. This paper proposes a technique of application mapping and scheduling with checkpointing on a multiprocessor system to maximize the reliability considering transient faults. The proposed model incorporates checkpoints with imperfect fault detection probability, and pipelined execution and cyclic dependency associated with multimedia applications. This is solved using an Artificial Intelligence technique known as Particle Swarm Optimization to determine the number of checkpoints of every task of the application that maximizes the confidence of the output. The proposed approach is validated experimentally with synthetic and real-life application graphs. Results demonstrate the proposed technique improves the probability of correct result by an average 15% with imperfect fault detection. Additionally, even with 100% fault detection, the proposed technique is able to achieve better results (25% higher confidence) as compared to the existing fault-tolerant techniques.
{"title":"Artificial intelligence based task mapping and pipelined scheduling for checkpointing on real time systems with imperfect fault detection","authors":"Anup Das, Akash Kumar, B. Veeravalli","doi":"10.1109/DFT.2014.6962066","DOIUrl":"https://doi.org/10.1109/DFT.2014.6962066","url":null,"abstract":"Fault-tolerance is emerging as one of the important optimization objectives for designs in deep submicron technology nodes. This paper proposes a technique of application mapping and scheduling with checkpointing on a multiprocessor system to maximize the reliability considering transient faults. The proposed model incorporates checkpoints with imperfect fault detection probability, and pipelined execution and cyclic dependency associated with multimedia applications. This is solved using an Artificial Intelligence technique known as Particle Swarm Optimization to determine the number of checkpoints of every task of the application that maximizes the confidence of the output. The proposed approach is validated experimentally with synthetic and real-life application graphs. Results demonstrate the proposed technique improves the probability of correct result by an average 15% with imperfect fault detection. Additionally, even with 100% fault detection, the proposed technique is able to achieve better results (25% higher confidence) as compared to the existing fault-tolerant techniques.","PeriodicalId":414665,"journal":{"name":"2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}