首页 > 最新文献

2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing最新文献

英文 中文
A Dependability Solution for Homogeneous MPSoCs 同构mpsoc的可靠性解决方案
Pub Date : 2011-12-12 DOI: 10.1109/PRDC.2011.16
Xiao Zhang, H. Kerkhoff
Nowadays highly dependable electronic devices are demanded by many safety-critical applications. Dependability attributes such as reliability and availability/maintainability of a many-processor system-on-chip (MPSoC) should already be examined at the design phase. Design for dependability approaches such as using available fault-free processor-cores and introducing a dependability manager infrastructural IP for self-test and evaluation can greatly enhance the dependability of an MPSoC. This is further supported by subsequent software-based repair. Design choices such as test fault coverage, test and repair time are examined to optimize the dependability attributes. Utilizing existing infrastructures like a network-on-chip (NoC) and tile-wrappers are needed to ensure a test can be performed at application run-time. An example design following the proposed design for dependability approach is shown. The MPSoC has been processed and measurement results have validated the proposed dependability approach.
如今,许多安全关键应用都要求高度可靠的电子设备。可靠性属性,如多处理器片上系统(MPSoC)的可靠性和可用性/可维护性,应该在设计阶段就进行检查。可靠性方法的设计,如使用可用的无故障处理器内核和引入可靠性管理器基础架构IP进行自测和评估,可以大大提高MPSoC的可靠性。后续的基于软件的修复进一步支持这一点。设计选择,如测试故障覆盖率,测试和修复时间进行检查,以优化可靠性属性。需要利用现有的基础设施,如片上网络(NoC)和贴片包装器,以确保可以在应用程序运行时执行测试。给出了采用可靠性设计方法的一个设计实例。对MPSoC进行了处理,测量结果验证了所提出的可靠性方法。
{"title":"A Dependability Solution for Homogeneous MPSoCs","authors":"Xiao Zhang, H. Kerkhoff","doi":"10.1109/PRDC.2011.16","DOIUrl":"https://doi.org/10.1109/PRDC.2011.16","url":null,"abstract":"Nowadays highly dependable electronic devices are demanded by many safety-critical applications. Dependability attributes such as reliability and availability/maintainability of a many-processor system-on-chip (MPSoC) should already be examined at the design phase. Design for dependability approaches such as using available fault-free processor-cores and introducing a dependability manager infrastructural IP for self-test and evaluation can greatly enhance the dependability of an MPSoC. This is further supported by subsequent software-based repair. Design choices such as test fault coverage, test and repair time are examined to optimize the dependability attributes. Utilizing existing infrastructures like a network-on-chip (NoC) and tile-wrappers are needed to ensure a test can be performed at application run-time. An example design following the proposed design for dependability approach is shown. The MPSoC has been processed and measurement results have validated the proposed dependability approach.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115237645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers RAMpage:商用Linux服务器中内存错误的优雅降级管理
Pub Date : 2011-12-12 DOI: 10.1109/PRDC.2011.20
Horst Schirmeier, J. Neuhalfen, Ingo Korb, O. Spinczyk, M. Engel
Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.
内存错误是当前计算机可靠性问题的一个主要来源。未检测到的错误可能会导致程序终止,或者更糟糕的是,无声的数据损坏。最近的研究表明,永久性记忆错误的频率比以前假设的要高一个数量级,并且经常影响日常操作。通常,既不能提供额外的电路来支持基于硬件的错误检测,也不能提供执行硬件测试的停机时间。在永久性内存错误的情况下,系统面临两个挑战:尽早检测错误并在避免系统停机的同时处理错误。为了提高系统可靠性,我们开发了RAMpage,这是一种用于商用x86-64 Linux服务器的在线内存测试基础设施,它能够有效地检测内存错误,并通过从进一步使用中提取受影响的内存页来提供优雅的降级。我们描述了RAMpage的设计和实施,并介绍了广泛的定性和定量评估的结果。
{"title":"RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers","authors":"Horst Schirmeier, J. Neuhalfen, Ingo Korb, O. Spinczyk, M. Engel","doi":"10.1109/PRDC.2011.20","DOIUrl":"https://doi.org/10.1109/PRDC.2011.20","url":null,"abstract":"Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127926076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Augmenting Functional Broadside Tests for Transition Fault Coverage with Bounded Switching Activity 有界切换活动下过渡故障覆盖率的增强功能宽边测试
Pub Date : 2011-12-12 DOI: 10.1109/PRDC.2011.14
I. Pomeranz
For most purposes, it is sufficient for a low-power test set to ensure that the power dissipation during test application will not exceed that possible during functional operation. This is guaranteed for the fast functional capture cycles of functional broadside tests. This paper describes a procedure that generates broadside test sets with bounded switching activity during fast functional capture cycles based on the maximum switching activity of a functional broadside test set, targeting transition faults in full-scan circuits. The procedure first generates a compact functional broadside test set. It then extends the test set in steps in order to increase its fault coverage to that of an arbitrary broadside test set (a test set that includes non-functional broadside tests). During these steps, the maximum switching activity of the functional broadside test set is used for bounding the switching activity.
对于大多数用途,低功耗测试集足以确保测试应用期间的功耗不会超过功能操作期间的可能功耗。这保证了功能侧测试的快速功能捕获周期。本文描述了一种基于功能宽边测试集的最大切换活动,在快速功能捕获周期中生成具有有界切换活动的宽边测试集的过程,针对全扫描电路中的过渡故障。该程序首先生成一个紧凑的功能侧测试集。然后,它分步骤扩展测试集,以便将其故障覆盖率增加到任意宽边测试集(包括非功能宽边测试的测试集)的故障覆盖率。在这些步骤中,使用功能侧测试集的最大切换活动来限定切换活动。
{"title":"Augmenting Functional Broadside Tests for Transition Fault Coverage with Bounded Switching Activity","authors":"I. Pomeranz","doi":"10.1109/PRDC.2011.14","DOIUrl":"https://doi.org/10.1109/PRDC.2011.14","url":null,"abstract":"For most purposes, it is sufficient for a low-power test set to ensure that the power dissipation during test application will not exceed that possible during functional operation. This is guaranteed for the fast functional capture cycles of functional broadside tests. This paper describes a procedure that generates broadside test sets with bounded switching activity during fast functional capture cycles based on the maximum switching activity of a functional broadside test set, targeting transition faults in full-scan circuits. The procedure first generates a compact functional broadside test set. It then extends the test set in steps in order to increase its fault coverage to that of an arbitrary broadside test set (a test set that includes non-functional broadside tests). During these steps, the maximum switching activity of the functional broadside test set is used for bounding the switching activity.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122406062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers 数值缺陷校正作为一种基于算法的迭代求解容错技术
Pub Date : 2011-07-01 DOI: 10.1109/PRDC.2011.26
Fabian Oboril, M. Tahoori, V. Heuveline, D. Lukarski, Jan-Philipp Weiss
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.
随着基于纳米级技术节点的处理器内核和内存子系统等硬件设备变得越来越不可靠,在许多计算/任务时间长的关键应用中使用的容错数值计算引擎的需求变得越来越明显。本文利用数值缺陷校正的优势,提出了一种基于共轭梯度法(CG)的迭代线性求解引擎的算法容错方案。此方法是“随用随付”,这意味着如果发生错误并执行更正,实际上只有运行时开销。我们与基于软件的三模冗余(TMR)的实验比较清楚地表明,该方法的运行时优势,良好的容错性和不发生无声数据损坏。
{"title":"Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers","authors":"Fabian Oboril, M. Tahoori, V. Heuveline, D. Lukarski, Jan-Philipp Weiss","doi":"10.1109/PRDC.2011.26","DOIUrl":"https://doi.org/10.1109/PRDC.2011.26","url":null,"abstract":"As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is \"pay as you go\", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127505033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
A Self-Stabilizing Synchronization Protocol for Arbitrary Digraphs: A Self-Stabilizing Distributed Clock Synchronization Protocol For Arbitrary Digraphs 适用于任意数字图的自稳定同步协议:适用于任意数字图的自稳定分布式时钟同步协议
Pub Date : 2011-02-01 DOI: 10.1109/PRDC.2011.37
M. Malekpour
This paper presents a self-stabilizing distributed clock synchronization protocol in the absence of faults in the system. It is focused on the distributed clock synchronization of an arbitrary, non-partitioned digraph ranging from fully connected to 1-connected networks of nodes while allowing for differences in the network elements. This protocol does not rely on assumptions about the initial state of the system, other than the presence of at least one node, and no central clock or a centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. There is no theoretical limit on the maximum number of participating nodes. The only constraint on the behavior of the node is that the interactions with other nodes are restricted to defined links and interfaces. This protocol deterministically converges within a time bound that is a linear function of the self-stabilization period. We present an outline of a deductive proof of the correctness of the protocol. A bounded model of the protocol was mechanically verified for a variety of topologies. Results of the mechanical proof of the correctness of the protocol are provided. The model checking results have verified the correctness of the protocol as they apply to the networks with unidirectional and bidirectional links. In addition, the results confirm the claims of determinism and linear convergence. As a result, we conjecture that the protocol solves the general case of this problem. We also present several variations of the protocol and discuss that this synchronization protocol is indeed an emergent system.
本文提出了一种在系统无故障情况下的自稳定分布式时钟同步协议。它主要针对从完全连接到 1 连接节点网络的任意非分区数字图的分布式时钟同步,同时允许网络元素的差异。除了至少有一个节点存在外,该协议不依赖于对系统初始状态的假设,也不使用中央时钟或中央生成的信号、脉冲或信息。节点是匿名的,即它们没有唯一的身份。参与节点的最大数量没有理论限制。对节点行为的唯一限制是,与其他节点的交互仅限于已定义的链接和接口。该协议可在自稳定期的线性函数时间范围内确定性地收敛。我们概述了该协议正确性的演绎证明。针对各种拓扑结构,对该协议的有界模型进行了机械验证。我们提供了协议正确性的机械证明结果。模型检查结果验证了协议的正确性,因为它们适用于具有单向和双向链接的网络。此外,结果还证实了确定性和线性收敛性的说法。因此,我们推测该协议可以解决该问题的一般情况。我们还介绍了该协议的几种变体,并讨论了该同步协议确实是一个新兴系统。
{"title":"A Self-Stabilizing Synchronization Protocol for Arbitrary Digraphs: A Self-Stabilizing Distributed Clock Synchronization Protocol For Arbitrary Digraphs","authors":"M. Malekpour","doi":"10.1109/PRDC.2011.37","DOIUrl":"https://doi.org/10.1109/PRDC.2011.37","url":null,"abstract":"This paper presents a self-stabilizing distributed clock synchronization protocol in the absence of faults in the system. It is focused on the distributed clock synchronization of an arbitrary, non-partitioned digraph ranging from fully connected to 1-connected networks of nodes while allowing for differences in the network elements. This protocol does not rely on assumptions about the initial state of the system, other than the presence of at least one node, and no central clock or a centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. There is no theoretical limit on the maximum number of participating nodes. The only constraint on the behavior of the node is that the interactions with other nodes are restricted to defined links and interfaces. This protocol deterministically converges within a time bound that is a linear function of the self-stabilization period. We present an outline of a deductive proof of the correctness of the protocol. A bounded model of the protocol was mechanically verified for a variety of topologies. Results of the mechanical proof of the correctness of the protocol are provided. The model checking results have verified the correctness of the protocol as they apply to the networks with unidirectional and bidirectional links. In addition, the results confirm the claims of determinism and linear convergence. As a result, we conjecture that the protocol solves the general case of this problem. We also present several variations of the protocol and discuss that this synchronization protocol is indeed an emergent system.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127379955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1